Students are encouraged to work together on homework. However, sharing, copying or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible.
Final submissions must be uploaded to our Compass 2g site on the Homework page. No email, hardcopy, or late submissions will be accepted.
.zip file, named hw01_yourNetID.zip, which contains:
hw08_yourNetID.Rmd. For example hw08_dunger.Rmd.hw08_yourNetID.html. For example hw08_dunger.html.nutrition.csv..html file will be considered a “report” which is the material that will determine the majority of your grade. Be sure to visibly include all R code and output that is relevant to answering the exercises. (You do not need to include irrelevant code you tried that resulted in error or did not answer the question correctly.).Rmd file as a template, be sure to remove the directions section. Consider removing eval = FALSE from any code chunks provided in the template, if you would like to run that code as part of your assignment..Rmd file should be written such that, if it is placed in a folder with any data your are asked to import, it will knit properly without modification.R for each of the exercises.For this exercise we will use the data stored in epa2015.csv. It contains detailed descriptions of 4,411 vehicles manufactured in 2015 that were used for fuel economy testing as performed by the Environment Protection Agency. The variables in the dataset are:
Make - manufacturerModel - model of vehicleID - manufacturer defined vehicle identification number within EPA’s computer system (not a VIN number)disp - cubic inch displacement of test vehicletype - car, truck, or both (for vehicles that meet specifications of both car and truck, like smaller SUVs or crossovers)horse - rated horsepower, in foot-pounds per secondcyl - number of cylinderslockup - vehicle has transmission lockup; N or Ydrive - drivetrain system code
weight - test weight, in poundsaxleratio - axle rationvratio - n/v ratio (engine speed versus vehicle speed at 50 mph)THC - total hydrocarbons, in grams per mile (g/mi)CO - Carbon monoxide (a regulated pollutant), in g/miCO2 - Carbon dioxide (the primary byproduct of all fossil fuel combustion), in g/mimpg - fuel economy, in miles per gallon CO2 using both horse and type. In practice we would use many more predictors, but limiting ourselves to these two, one numeric and one factor, will allow us to create a number of plots.(a) Load the data, and check its structure using str(). Verify that type is a factor; if not, coerce it to be a factor.
(b) Make a scatterplot of CO2 versus horse. Use a different color point for each vehicle type. Which color is which type?
(c) Fit a SLR model with CO2 as the response and only horse as the predictor. Recreate your plot and add the fitted regression line. Comment on how well this line models the data. Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type truck. Give a 95% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type Both. (Interestingly, the dataset gives the wrong drivetrain for most Subarus in this dataset, as they are almost all listed as F, when they are in fact all-wheel drive.)
(d) Fit an additive multiple regression model with CO2 as the response and horse and type as the predictors. Recreate your plot and add the fitted regression “lines” with the same colors as their respective points. Comment on how well these lines model the data. Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type truck. Give a 95% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type Both.
(e) Fit an interaction multiple regression model with CO2 as the response and horse and type as the predictors. Recreate your plot and add the fitted regression “lines” with the same colors as their respective points. Comment on how well these lines model the data. Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type truck. Give a 95% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type Both.
(f) You will perform \(F\)-tests later in the exercise, but for now, based solely on the three previous plots, which model is preferred: SLR, additive, or interaction?
(g) Use an ANOVA \(F\)-test to compare the SLR and additive models. Based on this test and a significance level of \(\alpha = 0.01\), which model is preferred?
(h) Use an ANOVA \(F\)-test to compare the additive and interaction models. Based on this test and a significance level of \(\alpha = 0.01\), which model is preferred?
For this exercise we will use the data stored in hospital.csv. It contains a random sample of 580 seriously ill hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:
Days - Days to death or hospital dischargeAge - Age on day of hospital admissionSex - female or maleComorbidity - Patient diagnosed with more than one chronic diseaseEdYears - Years of educationEducation - Education level; high or lowIncome - Income level; high or lowCharges - Hospital charges, in dollarsCare - Level of care required; high or lowRace - Non-white or whitePressure - Blood pressure, in mmHgBlood - White blood cell count, in gm/dLRate - Heart rate, in bpmFor this exercise, we will use Charges, Pressure, Care, and Race to model Days.
(a) Load the data, and check its structure using str(). Verify that Care and Race are factors; if not, coerce them to be factors. What are the levels of Care and Race?
(b) Fit an additive multiple regression model with Days as the response using Charges, Pressure, Care, and Race as predictors. What does R choose as the reference level for Care and Race?
(c) Fit a multiple regression model with Days as the response. Use the main effects of Charges, Pressure, Care, and Race, as well as the interaction of Care with each of the numeric predictors as predictors. (that is, the interaction of Care with Charges and the interaction of Care with Pressure). Use a statistical test to compare this model to the additive model using a significance level of \(\alpha = 0.01\). Which do you prefer?
(d) Fit a multiple regression model with Days as the response. Use the predictors from the model in (c) as well as the interaction of Race with each of the numeric predictors. (that is, the interaction of Race with Charges and the interaction of Race with Pressure). Use a statistical test to compare this model to the additive model using a significance level of \(\alpha = 0.01\). Which do you prefer?
(e) Using the model in (d), give an estimate of the change in average Days for a one-unit increase in Pressure for a "white" patient that required a high level of care.
(f) Find a model using the four predictors that we have been considering that is more flexible than the model in (d) and that is also statistically significant as compared to the model in (d) at a significance level of \(\alpha = 0.01\).
For this exercise we will use the data stored in fish.csv. It contains data for 158 fish of 7 different species all gathered from the same lake in one season. The variables in the dataset are:
Species - Common name (Latin name)
Weight - Weight of the fish, in gramsLength1 - Length from the nose to the beginning of the tail, in cmLength2 - Length from the nose to the notch of the tail, in cmLength3 - Length from the nose to the end of the tail, in cmHeightPct - Maximal height as % of Length3WidthPct - Maximal width as % of Length3Sex - 0 = female, 1 = maleWe will attempt to predict Weight using Length1, HeightPct, and WidthPct.
(a) Use R to fit the model
\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon, \]
where
WeightLength1HeightPctWidthPct.Report the estimated coefficients of the model.
(b) Consider fitting a smaller model in R.
fish_smaller = lm(Weight ~ Length1 + HeightPct * WidthPct, data = fish)
Use a statistical test to compare this model with the previous. Report the following:
(c) Give an expression based on the model in (a) for the true change in average weight for a 1 cm increase in Length1 for a fish with a HeightPct of 20 and a WidthPct of 10.
(d) Give an expression based on the smaller model in (b) for the true change in average weight for a 1 cm increase in Length1 for a fish with a HeightPct of 20 and a WidthPct of 10.
In this exercise, we will try to convince ourselves that a two-sample \(t\)-test assuming equal variance is the same as a \(t\)-test for the coefficient in front of a single factor variable in a linear model.
First we setup the data frame that we will use throughout.
n = 16
ex4 = data.frame(
groups = c(rep("A", n / 2), rep("B", n / 2)),
values = rep(0, n))
str(ex4)
## 'data.frame': 16 obs. of 2 variables:
## $ groups: chr "A" "A" "A" "A" ...
## $ values: num 0 0 0 0 0 0 0 0 0 0 ...
We will use a total sample size of 16, 8 for each group. The groups variable splits the data into two groups, A and B, which will be the grouping variable for the \(t\)-test and a factor variable in a regression. The values variable will store simulated data.
We will repeat the following process a number of times.
ex4$values = rnorm(n, mean = 10, sd = 3) # simualte data
summary(lm(values ~ groups, data = ex4))
##
## Call:
## lm(formula = values ~ groups, data = ex4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3590 -1.0837 0.3642 1.0391 5.5094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.263 1.005 9.217 2.54e-07 ***
## groupsB 2.045 1.421 1.439 0.172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.843 on 14 degrees of freedom
## Multiple R-squared: 0.1288, Adjusted R-squared: 0.06655
## F-statistic: 2.069 on 1 and 14 DF, p-value: 0.1723
t.test(values ~ groups, data = ex4, var.equal = TRUE)
##
## Two Sample t-test
##
## data: values by groups
## t = -1.4386, df = 14, p-value = 0.1723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.093195 1.003801
## sample estimates:
## mean in group A mean in group B
## 9.263231 11.307928
We use lm() to test
\[ H_0: \beta_1 = 0 \]
for the model
\[ Y = \beta_0 + \beta_1 x_1 + \epsilon \]
where \(Y\) are the values of interest, and \(x_1\) is a dummy variable that splits the data in two. We will let R take care of the dummy variable.
We use t.test() to test
\[ H_0: \mu_A = \mu_B \]
where \(\mu_A\) is the mean for the A group, and \(\mu_B\) is the mean for the B group.
The following code sets up some variables for storage.
num_sims = 100
lm_t = rep(0, num_sims)
lm_p = rep(0, num_sims)
tt_t = rep(0, num_sims)
tt_p = rep(0, num_sims)
lm_t will store the test statistic for the test \(H_0: \beta_1 = 0\).lm_p will store the p-value for the test \(H_0: \beta_1 = 0\).tt_t will store the test statistic for the test \(H_0: \mu_A = \mu_B\).tt_p will store the p-value for the test \(H_0: \mu_A = \mu_B\).The variable num_sims controls how many times we will repeat this process, which we have chosen to be 100.
(a) Set a seed equal to your UIN. Then write code that repeats the above process 100 times. Each time, store the appropriate values in lm_t, lm_p, tt_t, and tt_p. Specifically, each time you should use ex4$values = rnorm(n, mean = 10, sd = 3) to update the data. The grouping will always stay the same.
(b) Report the value obtained by running mean(lm_t == tt_t), which tells us what proportion of the test statistics are equal. The result may be extremely surprising!
(c) Report the value obtained by running mean(lm_p == tt_p), which tells us what proportion of the p-values are equal. The result may be extremely surprising!
(d) If you have done everything correctly so far, your answers to the last two parts won’t indicate the equivalence we want to show! What the heck is going on here? The first issue is one of using a computer to do calculations. When a computer checks for equality, it demands equality; nothing can be different. However, when a computer performs calculations, it can only do so with a certain level of precision. So if we calculate two quantities we know to be analytically equal, they can differ numerically. Instead of mean(lm_p == tt_p) run all.equal(lm_p, tt_p). This will perform a similar calculation, but with a very small error tolerance for each equality. What is the result of running this code? What does it mean?
(e) Your answer in (d) should now make much more sense. Then what is going on with the test statistics? Take a look at the values stored in lm_t and tt_t. What do you notice? Is there a relationship between the two? Can you explain why this is happening?